System and method for sound recognition with feature selection synchronized to voice pitch

ABSTRACT

Speech signals are transformed into phoneme identification signals which represent the first two positive slope durations following the maximum peak amplitude of a pitch cycle.

BACKGROUND OF THE INVENTION

The invention relates to systems and methods for speaker indepedentcontinuous and connected speech recognition and characteristic soundrecognition, and more particularly to sytems and methods for dealingwith both rapid and slow transitions between phonemes and characteristicsounds, and for dealing with silence and distinguishing between certainclosely related phonemes and characteristic sounds, and for processingthe phonemic recognition in real time.

In recent years there has been a great deal of research in the area ofvoice recognition because there are numerous potential applications fora reliable, low-cost voice recognition system. The following referencesare illustrative of the state of the art:

    ______________________________________                                        U.S. Pat. No. 3278685                                                                      Harper     1966   IBM                                            U.S. Pat. No. 3416080                                                                      Wright     1966   SEC (UK)                                       U.S. Pat. No. 3479460                                                                      Clapper    1969   IBM                                            U.S. Pat. No. 3485951                                                                      Hooper     1969   Private                                        U.S. Pat. No. 3488446                                                                      Miller     1970   Bell                                           U.S. Pat. No. 3499989                                                                      Cotterman  1970   IBM                                            U.S. Pat. No. 3499990                                                                      Clapper    1970   IBM                                            U.S. Pat. No. 3573612                                                                      Scarr      1968   STC                                            U.S. Pat. No. 3603738                                                                      Fecht      1971   Philco                                         U.S. Pat. No. 3617636                                                                      Ogihara    1971   NEC (Japan)                                    U.S. Pat. No. 3646576                                                                      Griggs     1970   Private                                        U.S. Pat. No. 374214                                                                       Newman     1970   NRDC (UK)                                      U.S. Pat. No. 3770892                                                                      Clapper    1972   IBM                                            U.S. Pat. No. 3916105                                                                      McCray     1975   IBM                                            U.S. Pat. No. 3946157                                                                      Dreyfus    1976   Private                                        U.S. Pat. No. 4343969                                                                      Kellet     1982   Trans-Data Assoc                               ______________________________________                                    

As mentioned in my prior U.S. Pat. No. 4,284,846, many problems stillremain to be solved in speech recognition, and include the normalizationof speech signals to compensate for amplitude and pitch variations inspeech by different persons, obtaining of reliable and efficientparametric representation of speech signals for processing by digitalcomputers, identifying and utilizing the demarcation points betweenadjacent phonemes, identifying the onset of each voiced pitch cycle,identifying of very short duration phonemes, and ensuring that thespeech recognition system can adapt to different speakers or newvocabularies.

The system described as a preferred embodiment in my previous U.S. Pat.No. 4,284,846 represents a major step forward in the evolution of speechand sound recognition systems, in that it shows that a system with verylittle hardware, including a single, low-cost integrated circuitmicroprocessor, can achieve real time recognition of spoken phonemesFurthermore, the system described in that patent is relativelyspeaker-independent.

However, my subsequent research has shown that the system described inU.S. Pat. No. 4,284,846 requires more software than was orignallyexpected for dealing with gaps of "silence" between phonemes in someordinary speech. My subsequent research has also shown that more cluesare necessary to reliably distinguish between certain closely relatedphonemes than is indicated in my U.S. Pat. No. 4,284,846. Furthermore,my subsequent research has shown that the considerable variation in thepitch of any typical person's normal speaking voice, and the affect uponthe speech waveform of the configuration and position of the speaker'svarious "articulators", such as the size and shape of the mouth cavity,the size and shape of the nasal cavity, the size and shape and positionof the tongue, the size and position of the teeth, and the size andposition of the lips of the speaker cause, in some cases, inaccuraciesin the "characteristic ratios" described in my U.S. Pat. No. 4,284,846.This makes it more difficult to achieve a comprehensive, completelyinclusive, speaker-independent phoneme recognition system than Ipreviously thought to be the case.

Accordingly, there still remains an unmet need for a less expensive,more accurate, more reliable, more speaker-independent, and morepitch-independent voice recognition than is possibly achievable by anydevice or system or method disclosed in the prior art, other than my ownprior patents, presently known to me.

In the area of speaker-dependent voice command recognition systems thereare a number of devices presently available. They are capable ofreceiving, for example, simple word commands and producing correspondingdigital command codes which are transmitted to a computer. Typically,such voice command systems must be "trained" to recognize particularcommand words spoken by a particular speaker. It should be appreciatedthat an average person can not speak the same word in exactly the sameway twice. In fact, there is a great variation in the speech waveformsproduced when an average person tries to speak the same word a number oftimes. Present speaker-dependent voice command recognition systems arenot capable of storing digitized speech waveform data for only oneutterance of a particular command, and then later reliably recognizingthe same word spoken by the same speaker. Therefore, the presentlyavailable systems are "trained" by instructing the speaker to speak thedesired word into the system's microphone a number of times. Themicrophone signal for each repetition of the word is amplified anddigitized, typically by using zero-crossing techniques, and sometime byusing analog to digital converters and processing the resulting digitaloutput. Some of the available systems compare each stored version ofthat word with the digitized version of a later spoken utterance of thatcommand word to try to match the spoken command with one of the storedversions of it. Various auto-correlation operations are performed todetermine if there is a match. Other systems use various techniques toaverage the digitized data of the numerous utterances of the samecommand word received during the "training" session, and then compare alater spoken command and with the stored averaged data in attempting torecognize the spoken command. Such prior systems are slow, expensive,and unreliable, and are not yet widely used.

SUMMARY OF THE INVENTION

Accordingly, it is an object of the invention to provide a means ofaccurately representing the phonemic information contained primarily inthe positive pressure wavefronts of a speech waveform in simple x, ycoordinate pairs to form a phoneme map.

It is another object of the invention to provide an accurate phoneme orcharacteristic sound recognizing system that uses a simple digitizingcircuit for converting an analog speech waveform to a binary waveform inwhich the major inflection points of the speech waveform correspond tothe transition edges of the binary waveform.

It is another object of the invention to provide a method and apparatusfor extracting the major phonemic information or characteristic soundinformation from selected time windows representing the smallestduration portions of each sound waveform pitch cycle that contains suchphonemic information or characteristic sound information.

It is another object of the invention to provide a phoneme recognitionsystem or characteristic sound recognition system which operates in afashion that corresponds to and is consistent with the internal workingsof the human inner ear cochlear mechanics.

It is another object of the invention to provide an effective and simpleway of dealing with silent intervals in speech so as to minimize circuitand/or digital processor activity in response to background "whitenoise" pulses contained in such silent intervals.

It is another object of the invention to provide an accurate, simpledevice for obtaining starting and ending times of phonemes, and fordemarcating silent intervals between phonemes, and for storing suchtimes or intervals.

It is another object of the invention to provide a means of compensatingfor modulating shifts produced in the characteristic relationshipsbetween time durations of certain major points of inflection of thespeech waveform caused by inharmoneous differences between the frequencyof the pitch cycle being produced by a speaker's vocal cords and theresonant frequencies of the articulating portion of the speaker's mouth.

It is another object of the invention to provide a means of producingand using signals representing fast-moving trajectories of two keyparameters on a phoneme map to provide "clue signals" that distinguishbetween closely related short duration consonant phonemes.

It is another object of the invention to provide a method and apparatusfor reducing the complexity of the analog waveform produced by amicrophone receiving continuous, connected-word speech signals to allowcircuitry of minimum complexity to produce a binary representation ofthe speech signals, which binary representation can be processed in realtime by a single chip microcomputer to accurately, reliably, andindependently of the speaker and pitch of his voice, produce phonemicidentification code or symbols that can be displayed or printed orotherwise used in electromechanical or electronic apparatus which maybenefit from man-to-machine spoken communication using accuratelyidentified speech phonemic elements, to accurately represent humanspeech.

It is another object of the invention to provide an improved method andsystem for "compact" digital representation of speech and sound signalswhich digitally represent the most significant portions of a speechwaveform.

It is another object of the invention to provide an improved, low cost,reliable, low complexity method and apparatus for accomplishingspeaker-dependent voice comand recognition.

It is another object of the invention to provide an improved, low cost,reliable, low complexity device for accomplishing recognition of spokenutterances by comparison of features of the spoken utterances withutterance features previously stored by speaking the same utterancesinto the system.

Briefly described, and in accordance with one embodiment thereof, theinvention provides a system and method for producing a binary signalhaving a "1" level during positive pressure wave portions of a soundsignal and a "0" level during negative pressure wave portions of thesound signal, detecting the time point of the major peak positive andnegative excursions of each pitch cycle of the sound signal andproducing a corresponding pitch cycle marker signal that occurssubstantially at the beginning of each pitch cycle, producing a firstnumber that represents the duration of a "1" level of the binary signalmost closely following a pulse of the pitch cycle signal and producing asecond number that represents the duration of the following "1" level ofthe binary signal, composing a vector from the first and second numbers,comparing that vector with a plurality of stored vector domains todetermine if the present vector falls within any of the stored vectordomains, and producing a character signal or a phoneme identifyingsignal or code representing a phoneme or sound corresponding to a one ofthe stored vectors that most nearly matches the present input vector. Inthe described embodiment of the invention, the system also produces afirst "running average" of the durations of the "1" levels and a second"running average" of the durations of the "0" levels for a plurality ofthe pitch cycles of the binary signals and also produces a summation ofthe durations of the "1" and "0" levels of each and every pitch cycle,divides that summation by the sum of the running averages of the "1" and"0" intervals, and, if there is a remainder, uses that remainder toobtain a correction factor that is added to or multiplied by the inputvector formed from the successive "1" levels to compensate forinaccuracies in the input vector caused by the influence of mismatchesbetween the pitch cycle of the sound presently uttered by a speaker andthe resonant frequency of the present configuration of the mouth cavityof the speaker. In the described embodiment of the invention, the changeor "velocity" of the durations of the "1" levels is computed on a realtime basis, as is the rate of change, or "acceleration" of the "1"levels. On the basis of these accelerations and/or velocities, whichrepresent the velocities and/or accelerations of various parts of thespeaker's articulation apparatus, character demarcation signals orphoneme demarcation signals or codes that represent the begining or endof a character or phoneme represented by the present sound signal areproduced.

Concurrently, fricative vectors are computed between the "1" and "0"running averages and are compared with a plurality of stored fricativevector domains to determine the best match of the stored fricativedomains with the presently computed fricative vector. A fricativephoneme signal corresponding to a one of the stored fricative vectordomains that best matches the presently computed fricative vector isproduced if there is such a matching, indicating non-voiced fricativephonemes or fricative content in a voiced phoneme. Intervals of silence,indicated by quite small durations of sporadic "1" intervals, areassigned a "0" level. If the duration of a silent interval falls into apredetermined range that is compared to a table of stored durations,each of which corresponds to the onset of a different plosive phoneme,then the identified interval is used to help identify the plosivephoneme corresponding to the matched stored silence duration. All of theforegoing information is utilized to generate various phoneme codesignals that can be used to drive a printer, display a sequence ofphoneme characters that accurately represent the incoming speech orother characteristic sounds or provide various pre-arranged activitiesin a wide variety of machines or apparatus which can benefit from theguidance of speech and sound signals.

In another embodiment of the invention, an apparatus and method aredisclosed for providing a compact digitizing of a speech or soundwaveform, by producing the above first and second running averages ofthe durations of the "1" and "0" levels, respectively, sampling thefirst and second running averages at predetermined intervals, operatingon the sampled data to identify and produce event data for significantevents in the speech or sound waveform, including rapidly rising andrapidly falling values of the first running average, "silence" values ofthe second running average to produce a very compact digitalrepresentation of the meaning of the speech or sound waveform. In adescribed embodiment of the invention, this event data is arranged in"windows" of significant events of the speech waveform, and the eventdata in each window is compared to previously stored event data whichpreviously has been similarly obtained by speaking selected words intothe apparatus and producing and storing significant data, identifyingmatches of significant event data in the window to stored referenceevent data, and repeating the procedure for event data in another windowthat begins immediately after a matching is detected, until apredetermined number of consecutive failures to match an event in aparticular window to that number of reference event or until all eventsof the utterance lows tested against reference events. Morespecifically, event data for each significant event of a first window iscompared to reference event data for a first reference event, and anymatching is identified, in accordance with weighting criteria, and ascore representing the degree of mismatching is accumulated. Immediatelyafter any matching is detected, a new window of subsequent significantevents of the speech or sound waveform is compared to a subsequentreference event. A higher degree of recognition of voiced commands isachieved at a lower degree of complexity than has been previouslyachieved. In this embodiment of the invention, a microphone, amplifyingand filter circuitry, inflection point detecting circuitry, and amicroprocessor are included in a circuit to produce the sampled valuesof the first and second running averages. The operations on the sampledvalues of the first and second running averages are performed by a desktop computer coupled by a cable to the unit, and the instructions forconducting these operations are contained in a program loaded into thedesk top computer.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an embodiment of the invention.

FIG. 2 is a diagram showing a portion of a speech waveform and a binarywaveform corresponding to major inflection points of the speechwaveform.

FIG. 3 is a diagram of a decision tree in which input vectors computedin real time, silence intervals, and swoop trajectory directionscomputed in real time are compared to entries in a stored vector map orlookup table to identify the most recently uttered phoneme.

FIG. 4 is a circuit schematic diagram of the microphone signal amplifierof FIG. 1.

FIG. 5 is a circuit schematic diagram of the audio band pass amplifierof FIG. 1.

FIG. 6 is a circuit schematic diagram of the pitch band pass amplifierof FIG. 1.

FIG. 7 is a circuit schematic diagram of the major peak detector circuitin FIG. 1.

FIG. 8 is a diagram showing several cycles of a speech waveform, and isuseful in describing the operation of the circuit of FIG. 7.

FIG. 9 is a circuit schematic diagram of the inflection point detectorcircuit of FIG. 1.

FIG. 10A is a diagram including two waveforms useful in describing theoperation of the inflection point detector circuit of FIG. 9.

FIG. 10B is a diagram showing two waveforms that are also useful indescribing the operation of the circuit of FIG. 9.

FIG. 11 is a circuit schematic diagram of the threshhold limitingcircuit of FIG. 1.

FIG. 12 is a circuit schematic diagram of the pulse shaper and latchcircuit of FIG. 1.

FIGS. 13A and 13B constitute a flow chart of a program executed bymicrocomputer 10 of FIG. 1 in accordance with the phoneme recognitionmethod of the present invention.

FIG. 14 is a flow chart of a program executed by the microcomputer 10 ofFIG. 1 to process binary data produced by the inflection point detectorcircuit of FIG. 1.

FIG. 15 is a flow chart of a subroutine executed by the microcomputer 10of FIG. 1 to service an interrupt request produced by the shaper andlatch circuit of FIG. 1.

FIG. 16 is a flow chart of a subroutine executed by microcomputer 10 ofFIG. 1 to compute updated values of certain variables used in theexecution of the program of FIGS. 13A and 13B.

FIG. 17 is a generalized fricative vector domain map useful foridentifying an unknown fricative phoneme.

FIG. 18 is a generalized voiced phoneme vector domain map useful inidentifying an unknown voiced phoneme.

FIG. 19 is a diagram illustrating another embodiment of the invention,which provides a speaker-dependent voiced command recognition system.

FIG. 20 is a graph of two waveforms that are useful in describing theoperation of the device of FIG. 19.

FIG. 21 is a memory map useful in describing the operation of the deviceof FIG. 19.

FIG. 22 is a diagram useful in explaining comparison of event dataderived from a voiced command spoken into the device of FIG. 19 withstored reference event data previously spoken into the device of FIG. 19during a "training" session.

FIG. 23 is a flow chart of a program executed by acomputer to affectuatethe event comparison procedure illustrated in FIG. 22.

FIG. 24 is a block diagram of an embodiment of the invention with ahigher degree of circuit integration than the embodiment of FIG. 1.

DESCRIPTION OF THE INVENTION

Referring now to FIG. 1, a speech analyzer circuit 1 includes amicrophone 2 which receives audible speech sounds and produces anelectrical signal that is applied to the input of a microphone signalamplifier circuit 3. In accordanc with the present invention, speechanalyzer circuit 1 produces electrical signals on bus 21 that identifyspeech phonemes and other audible sound-representing electrical signals.For example, the electrical signals produced on bus 21 can be ASCIIsignals or the like for causing phonemic symbols or other characters tobe printed or displayed on a suitable screen. Or, the signals onconductor 21 can be utilized to control a variety of other kinds ofelectromechanical apparatus, such as desk-top computers, automobiles,robots and the like, in response to voice commands. The signals onconductor 21 can also be utilized to operate apparatus such as devicesfor aiding speech-impaired persons, to operate phonetic typewriters, andcan find many other applications in the general field ofspeech-to-machinery communication.

Note that the electrical signals on bus 21 do not, however, representspeech that has been recognized in the semantic sense, nor do thesignals on bus 21 represent the correct spelling used to representsounds and words in various human languages. The electrical signals onbus 21 only represent characteristic sounds, such as phonemes which aregenerally generally used in all spoken languages.

The output produced by amplifier 3 is applied by conductor 4 to theinput of an audio band pass filter amplifier 5, which has a centerfrequency of approximately 550 hertz and passes a band of frequenciesadequate for phoneme recognition. The output of filter amplifier 5 isproduced on conductor 7 and applied to the input of an inflection pointdetector circuit 8, which performs the function of generating a binaryoutput signal on conductor 9. The positive and negative transitions ofbinary signal 9 occur at precisely the times of occurence of majorinflections of the input signal on conductor 7. For purposes of thisdisclosure, a major inflection is one that ocurs when the positive-goingor negative-going waveform on conductor 7 endures for at leastapproximately 50 microseconds before the next point of inflectionarrives and the slope of the waveform reverses direction. The binarywaveform on conductor 9 has the appearance of the signal 44C in FIG. 2,which is a duplicate of speech waveform 33 and binary signal 44C in FIG.7 of my prior U.S. Pat. No. 4,284,846 entitled "SYSTEM AND METHOD FORSOUND RECOGNITION", issued Aug. 18, 1981 and entirely incorporatedherein by reference.

Inflection point detector circuit 8 has an adjustment circuit 22 whichis connected to a ground conductor 23. Adjustment circuit 22 allowsnormalizing of the binary waveform on conductor 9 to a "standard"waveform. (As subsequently explained, the input circuitry used ininflection point detector circuit 8 has various input offset leakagecurrents which must be compensated for to produce a standard offset.)

The binary waveform on conductor 9 is a real time binary signal that isfed into a capture register 11 of a single chip microcomputer 10, whichcan be any of a variety of available devices, such as a MotorolaMC68701.

The above described path from microphone 2 to capture register 11 ofmicrocomputer 10 is one of the two input signal paths to microcomputer10. The other signal path applies the amplified analog speech signal ofconductor 4 to the input of a band pass "pitch" filter amplifier 6, theoutput of which is applied by conductor 12 to the input of a negativepeak detector circuit 13. The band pass "pitch" filter amplifier 6 has amuch lower pass band than the audio band pass filter amplifier 5,because the purpose of bandpass filter 6 is to encompass the "pitch" orvocal cord frequency of average human voices.

The output signal 14, which identifies the times of occurrence of majornegative peaks of the band pass filter output signal on conductor 12, isapplied to the input of a threshhold limiting circuit 15. Threshholdlimiting circuit 15 allows the microcomputer 10 to provide a threshholdadjustment voltage on conductor 16 to raise or lower the sensitivity ofthreshhold limiting circuit 15, so that only the negative peaks onconductor 14 having the greatest amplitude are passed onto conductor 17.These maximum amplitude signals on conductor 17 are referred to as"pitch trigger" signals or pulses. The "pitch trigger" signals such as24 are applied by conductor 17 to an input of a pulse shaper and latchcircuit 18.

The purpose of circuit 18 is to greatly shorten the time of the leadingedge of the trigger pulses to operate the binary latch. When the latchcircuit in block 18 is set, it produces a negative edge, such as the oneindicated by reference numeral 25. This negative edge 25 is interpretedas an interrupt by interrupt circuitry 27 inside microcomputer 10. Afterthe microcomputer has interpreted and serviced the interrupt requestsignal on conductor 19, microcomputer 10 produces a "clear" signal 26 onconductor 20 to clear the latch circuitry in block 18.

Note that the interrupt signal on conductor 19 must remain at a lowlevel until the microcomputer 10 has had an opportunity to interpret andact in response to the interrupt. Note also that the procedure forproducing the clear signal on conductor 20 and applying it to the latchin block 18 provides a means by which microcomputer 10 can "refuse" to"look at" pitch trigger signals on conductor 17. Thus, this circuitryprovides a means of increasing or lessening, under microcomputercontrol, the select number of pitch trigger signals that might beconsidered in the phoneme analyzing process of the invention. Otherwise,there would be frequent instances of multiple pitch triggers on strongor stressed sounds.

The selected pitch trigger signal serves as a pointer to locate thefirst negative pressure wave related to the onset of each glottal pulseof the speaker, in order to allow locating and measuring the durationsof the following first two positive pressure waves, which, in accordancewith my recent discoveries, contain a major portion of the informationthat allows accurate determination of the identity of the phoneme orcharacteristic speech sound presently being input to microphone 2.

The input capture register 11 of microcomputer 10 operates inconjunction with a software subroutine that detects the occurrence ofany negative-going or positive-going transition of the binary waveformon conductor 9 and stores the time of that occurrence in a 16 bitsoftware capture register. This provides accuracy of "capture" of eachtransition of the binary speech waveform 9 to within one microsecond.The disclosed arrangement allows microcomputer 10 to capture, in realtime and with very high accuracy, each major inflection point (asdefined above) of analog speech waveforms on conductor 7. The binarywaveform on conductor 9 is, in effect, a "piece-wise linear"approximation of the analog signal on conductor 7.

In accordance with the teachings of my prior U.S. Pat. No. 4,284,846,the characteristic ratios or vectors between such major inflectionpoints can be compared to stored phonemic vectors by means of thedecision tree shown in FIG. 3 (which is nearly identical to FIG. 8 in myabove mentioned U.S. Pat. No. 4,284,846) to rapidly identify, in realtime, the phoneme presently being uttered by a speaker into microphone 2or the characteristic sound presently being received by microphone 2.

In accordance with my more recent discoveries, a "vector map"representation of the first two positive pressure wave time durations,such as the vector maps shown in FIGS. 17 and 18, provides a moreaccurate decision tree than can be achieved using calculated ratios,which are scalar quantities, compared to vectored quanties which includescalar amplitudes and directions.

Referring next to FIG. 4, a detailed circuit schematic diagram of a wellknown microphone amplifier circuit 3 is shown. It includes aninexpensive electrect microphone cell 2 coupled to ground conductor 23and also coupled by a bias resistor 31 to a five volt supply conductor32. The output of microphone cell 2, which is the same as microphone 2in FIG. 1, is connected to one terminal of a 4.7 microfarad capacitor34, the other terminal of which is connected by means of a 2.2 kilohmresistor 35 to the negative input of an operational amplifier 37, whichcan be a National Semiconductor LM324. The positive input of operationalamplifier 37 is connected to conductor 39, to which a bias voltage ofapproximately two volts is applied. Conductor 36 is connected to thenegative input of operational amplifier 37, and is coupled by a 100kilohm feedback resistor 38 to the output of operational amplifier 37,which is also connected to conductor 4, also shown in FIG. 1.

The electrect microphone 2 has a flat pass band from under 100 hertz upto over 15 kilohertz. Capacitor 34 provides audio frequency couplingthat blocks off the bias voltage from the input of operational amplifier37. The gain of this amplifier circuitry is, of course, determined bythe ratio of the resistance of resistor 38 to the resistance of resistor35. However, care has to be taken to ensure that even slight couplingfrom the digital signal lines associated with microcomputer 10 ontoconductors 36 and 39 as a result of the physical layout configuration ofmicrophone amplifier circuit 3 is avoided.

Referring now to FIG. 5, a well known audio band pass amplifier circuit5 has its input connected to conductor 4. Conductor 4 is coupled by 18kilohm resistor 41 to conductor 43. Conductor 43 is coupled by 5.6kilohm resistor 45 to ground, and is also coupled by 0.01 microfaradcapacitor 42 to conductor 48, and by 0.01 microfarad capacitor 44 toconductor 47. A 240 kilohm feedback resistor 46 is coupled betweenconductors 47 and 48. Conductor 48 is connected to the negative input ofoperational amplifier 50, which can be a National LM324 operationalamplifier. Its output is connected to conductor 47 and its positiveinput is connected to conductor 49, to which a bias voltage ofapproximately two volts is applied.

The purpose of this circuit is to eliminate very high frequency, rapidlyfluctuating signals that could create pulses that are so fast, and areof such short duration (below approximately 50 microseconds) thatmicrocomputer 10 cannot reliably interpret them. On the other hand, aswill be discussed, the circuit of FIG. 5 must pass an adequate number ofpulses representing fricative phonemes.

An adequate range of frequencies for fricatives and voiced sounds has acenter frequency of approximately 560 hertz and a pass band going fromabout 200 hertz to approximately 2300 hertz. The above indicatedcomponent values produce this pass band and provide a gain ofapproximately 10.

Fricative sounds typically have frequencies of approximately 4,000 hertzto 6,000 hertz. Some fricative high frequency components do get throughthe simple band pass filter 5, which high frequency components arenecessary to provide all of the "clues" needed to adequately distinguishvarious fricative sounds.

The audio band pass amplifier 5, with the above-indicated componentsvalues, provides an output range of signal amplitudes from approximately60 millivolts peak-to-peak to approximately 3.3 volts peak-to-peak asinputs to the major inflection point detector circuit 8. In response tothis input, inflection point detector circuit 8 produces a constantamplitude digital signal which can represent the wide "dynamic range" ofspoken sounds.

This conversion of analog speech signals of widely varying amplitudes toa constant amplitude binary signal overcomes one of the severeshortcomings of many of the prior voice recognition approaches reportedin the literature, namely the problem of dealing with signals having avery wide range of amplitudes with an adequate level of accuracy.

In contrast, my circuitry is very insensitive to wide variations of theamplitude of the analog speech signals, contrary to the majority of theteachings in the literature and prior art pertaining to speechrecognition to the effect that the primary technique for demarcationbetween phonetic elements is by sensing or observing the amplitudeenvelope of the speech waveform.

In further contrast, I have found that by using the time duration of thepositive and negative slopes of the speech waveform, the rate of changeor acceleration of these durations can be analyzed to provide betterdemarcation between phonemic elements of speech, and presumably otherhighly characteristic features of other sound waveforms. In addition, Ihave found that changes in the pitch due to stress accenting of speechalso aids in the demarcation of connected phonetic elements.

The above described approach allows producing of digital signals thatare easily handled and analyzed by microcomputer 10, thereby avoidingthe difficulties in accurate analysis of analog signals having rapidlyvarying amplitudes.

Perhaps it will be best to describe the pitch band pass amplifiercircuit 6 in some detail before discussing the inflection pointdetection circuit 8.

Referring next to FIG. 6, the pitch band pass amplifier circuit 6 hasits input connected to conductor 4, which is coupled by 7.5 kilohmresistor 142 to conductor 143. Conductor 143 is coupled by 15 kilohmresistor 52 to ground conductor 23, by 0.047 microfarad capacitor 145 toconductor 146, and by 0.047 microfarad capacitor 149 to conductor 12.Conductor 146 is coupled to the negative input of an operationalamplifier circuit 147, which can be a National Semiconductor LM324operational amplifier, and by 68 kilohm feedback resistor 53 toconductor 12, which is also connected to the output of operationalamplifier 147. The positive input of operational amplifier 147 isconnected to conductor 148, to which a bias voltage of approximately twovolts is applied. This circuit, with the above-indicated componentvalues, has a center frequency of approximately 190 hertz, and has apass band of approximately 100 hertz to approximately 360 hertz, and hasa gain of approximately four. This circuit is a multiple feedback bandpass filter which is well known. Its pass band covers the normalfrequency range or pitch range of the glottal pulses uttered by mosthumans. The gain of filter amplifier 6 is only four because it isdesired to detect only relatively large amplitude negative pressure wavepeaks in the analog voice waveform produced by amplifier 13. This "pitchband pass amplifier" 6 is designed to very selective, so as to cancelout the higher frequency components and sharp background noise as muchas possible, and to provide amplification only in the band in whichhuman glottal pulses occur normally.

Referring next to FIG. 7, the major peak detector circuit 13 detectsmajor negative peaks in the complex audio waveform that is produced onconductor 4 by amplifier 3, and then is filtered by pitch filter 6 toproduce the signal on conductor 12. Peak detector circuit 13 has itsinput connected to conductor 12, which is coupled by a 2 kilohm resistor54 to one terminal of 0.01 microfarad capacitor 55, the other terminalof which is connected to conductor 56. Conductor 56 is connected to thenegative input of operational amplifier 60, which can be a NationalSemiconductor LM324, and is also coupled by 240 kilohm resistor 57 toconductor 14. Conductor 14 is connected to the output of operationalamplifier 60. The positive input of operational amplifier 60 isconnected to ground conductor 23.

The operation of peak detecting circuit 13 can be best understood withreference to the waveforms of FIG. 8. In FIG. 8, waveform 61 designatesa signal resulting from glottal vibrations of the vocal cords thatresult in producing of the signal on conductor 12, which we may call theanalog pitch signal. Waveform 62 is the output of peak detecting circuit13 on conductor 14. The above circuit detects the most negativeexcursion of the analog pitch signal 61, namely negative excursion 64.The input signal 61 rides on a two volt DC bias voltage.

As described elsewhere herein, each glottal pulse intiates areverberatory sequence of negative and positive peaks whose positionsand excursions depend on the cavity resonances of the speaker'sarticulation apparatus at the present moment. These excursions decay asthe sound energy is dissipated, until another glottal pulse occurs tore-energize the speaker's resonant cavities. Circuit 13 has ashortcoming in that it will detect moderate amplitude negativeexcursions in addition to larger negative excursions of the waveform 61.This results in output pulses on conductor 14 corresponding to waveform62 in FIG. 8.

More specifically, peak detector circuit 13 produces pulse 66 inresponse to the negative excursion indicated by reference numeral 64 andproduces pulse 67 in response to the negative excursion 65 of inputsignal 61. My above-described circuitry can detect only the maximumamplitude negative excursion of each pitch cycle for a four to one rangeof pitches.

Depending on how close the pitch cycle is harmonically related to thepresent resonant frequency of the voice cavities of the speaker, thesecond negative excursion, such as 65 in FIG. 8, can be almost as greatin amplitude as the first negative excursion 64. Accordingly, it wasnecessary to produce the threshhold limiting circuit 15, and providecomputer-controlled adjustment of the threshhold of circuit 15.

It should be borne in mind that a reason for detecting the most negativeexcursion of each pitch cycle of the speech waveform is to reliablyidentify the location of the first positive pressure wave producing thespeech waveform, since I have found out that the start of that pressurewave and the termination of the following positive pressure wavedelineates a pitch synchronized "window" which contains enoughinformation to allow accurate identifcation of the phoneme representedby the present portion of the speech waveform, provided, however, thatcircuit corrections are made for "discrepancies" between the presentpitch of the speaker's voice and the resonant frequency of the presentconfiguration of the speaker's mouth, i e., articulation apparatus. Theinformation outside of this pitch synchronized window is quitespeaker-dependent and pitch-frequency dependent, and therefore is notused to identify phonemes.

Many previous researchers have performed the filtering of the speechwaveform with large memory, high speed, very costly computers usingFourier transforms. Many researcbers have attempted to locate the peakpulses by looking at the digital numbers produced by analog to digitalconverters receiving an analog speech waveform. Others have attempted apurely analog approach to pitch detection. The use of dedicated analogcircuits to detect the major pitch pulses in cooperation with amicrocomputer algorithm as described here appears to be quite costeffective.

Next, the slope detector of major inflection point detector circuit ofFIG. 1 is shown in detail. This circuit is shown in FIG. 9, wherein thefiltered audio signal on conductor 7 is applied to the negative input ofa high sensitivity comparator 71, which can be a Motorola MC3302comparator. Conductor 7 is also coupled by 3.3 kilohm resistor 69 toconductor 70, which is connected to the positive input of comparator 71,and is also connected to the terminal of a capacitor 72. The otherterminal of capacitor 72 is connected to ground conductor 23. Conductor70 is also coupled by 220 kilohm resistor 73 to the tap of apotentiometer 74, which is connected between +5 volt conductor 32 andground.

The output of comparator 71 is connected to conductor 9 and is alsocoupled by 22 kilohm pull-up resistor 75 to the +5 volt conductor 32.The binary output signal 44C shown in FIG. 2 is produced on conductor 9.(Other examples of the binary output signal on conductor 9 are shown inFIGS. 10A and 10B, subsequently described.

A very important aspect of the improvement of the present invention isthe providing of a band-limited waveform such as the one produced onconductor 7, and, by means of a very simple circuit such as 8, producinga "piece-wise linear" binary representation of this band-limitedwaveform which is invariant with respect to the amplitude of the audiosignal and contains enough information to make phoneme recognition,especially speaker-independent phoneme recognition, possible.

Note that the strong pitch signal produced in response to the glottalpulses of the speaker has been attenuated on the low end of thefrequency range, and a great deal of the very high frequency signalssuch as high frequency fricatives and background noise of the typecommonly produced by fans and air conditioners, has been eliminated fromthe waveform on conductor 7. The filtered waveform on conductor 7 thenhas a bare minimum of features which nevertheless have enoughinformation to accurately identify phonemes and other characteristicsounds represented by the speech waveform produced by the microphone 2.

A major challenge in providing the improvements of the present inventionwas to determine how much detail could be eliminated from the analogspeech waveform while leaving enough detail that a relatively simplecircuit could extract that detail and make it available in binary formfor processing by a microcomputer, especially a single chipmicrocomputer, in such a way as to accurately detect, independently ofthe unique characteristics of a particular speaker, phonemes andcharacteristic sounds that make up speech and produce reliableidentifying signals that allow the speech to be accurately representedby phonemic signals.

Most of the slope detector circuits reported in the prior art count"zero crossings" per unit of time in the speech waveforms. This, ofcourse, causes important features in a speech waveform that do not crossthe zero or average level of a speech waveform, to be lost. My resultsindicate that such lost features include some of the most importantclues that are needed to detect and identify certain phonemiccharacteristics.

The circuit shown in FIG. 9, with no series resistor between conductor 7and sampling resistor 69 results in maximum sensitivity and maximumdynamic range. Resistor 69, and capacitor 72 must have a time constantthat will allow rapid response for the highest frequency that needs tobe "observed" by the speech or sound analyzer circuit 1. During the timethat the audio signal on conductor 7 goes in a positive direction, thecurrent flowing through resistor 69 into capacitor 72 charges it in onedirection, and causes a binary "1" to be produced at the output ofcomparator 71. When the next inflection point, such as 81-1, 81-2, 81-3,81-4, etc. in waveform 77 in FIG. 10A occurs, there will be a reversalin the current flow direction, causing a change from a logical "1" to alogical "0", or vice versa, on conductor 9. Binary waveform 78 in FIG.10A is the response of inflection point detector circuit 8 to thefiltered analog speech waveform 77, and the narrow pulses 77-1, 77-2 ofbinary waveform 78 can readily be seen to occur at the times of theabove-indicated inflection points caused by high frequency componentscontained in waveform 77.

In the publication "The Bionic Ear", Chapter 10, paragraph 10.7, byStewart, there is shown a waveform quite similar to that shown in FIG.10A of this application. The Stewart reference shows a procedure forconverting speech to a highly "clipped" digital form. The textualdescription refers to it as "clipped" speech.

In the above Stewart reference, the objective is to find a way ofsimplifying speech for improved tramsmission in digital form overelectrical lines, in essentially the same way that conventional deltamodulator circuits attempt to improve digital transmission of speechover long lines and then reconstruct the original analog speech signalfrom the received digital signal. However, the Stewart reference doesnot suggest that the binary output produced by this circuit has thenecessary information and accuracy to allow recognition of phonemes.

After many months of effort trying to simplify the "delta modulator"approach of my earlier U.S. Pat. No. 4,284,846 while still providing abinary output containing enough information to accurately detectphonemes, I independently arrived at a circuit quite similar to thatdisclosed in the Stewart reference, namely a circuit similar to that inFIG. 9 of the present application. One of major problems of the deltamodulator circuitry of my prior patent was that it produces a binaryoutput during periods of silence. The "silence" portion of the binarywaveform output by the delta modulator circuit is very difficult to dealwith, given the objective of reducing the software burden on a singlechip microcomputer.

After much experimentation and research, I eventually arrived at theconclusion that negative slopes, which I found to correspond to negativepressure waves of speech, contain far less useful phoneme-identifyinginformation that the positive slopes, which correspond to positivepressure waves of speech.

I concluded that it would therefore be economical to cause the negativeshapes to be represented by the same binary level, i.e, a "0 ", as thebinary level used to represent periods of "silence", which also containlittle or no phoneme-identifying information. If the bias level producedon conductor 70 in the circuit of FIG. 9 by the silence adjustmentcircuitry, including resistor 73 and potentiometer 74, is set properlyat the level indicated by reference numeral 79-1, this objective can beachieved so that the inflection point detector circuit will produce thedesired constant "0" level in response to silence or "white noise"consisting of very short "1" level pulses, as indicated by waveform 80,and also in response to very low-level fricative pulses in the "silence"waveform 79 of FIG. 10B. These fricative pulses include sharp, narrowpeaks with long durations of silence between them.

The periods of "silence" between the fricative pulses showm in waveform79 result in "0" levels in the binary output waveform produced onconductor 9 by inflection point detector circuit 8 of FIG. 9. Theprocedure for adjusting the desired "offset" is to apply a 200 millivolt1 kilohertz sine wave input to conductor 7, and adjust potentiometer 74so that a binary output is produced on conductor 9. The resulting offsetvoltage on the positive input of comparator 71 compensates for inputoffset variations that will occur from unit to unit with the MotorolaMC3302 circuit used to implement comparator 71.

Referring next to FIG. 11, the threshhold limiting circuit of FIG. 1 isshown. This circuit receives the peak pulses 62 (FIG. 8) on conductor 14produced by peak detector circuit 13, and couples them by a 110 kilohmresistor 83 to conductor 84. Conductor 84 is connected to the positiveinput of operational amplifier 85, which can be a National SemiconductorLM324. Conductor 84 is also coupled by 510 kilohm resistor 86 toconductor 17, which produces a pitch period signal to pulse shaper andlatch circuitry 18 of FIG. 12. Conductor 17 is also connected to theoutput of operational amplifier 85. The negative input of operationalamplifier 85 is connected to conductor 87, on which a threshholdreference voltage having a "default" value of approximately 1.8 volts isproduced by the resistive division of a +5 volt supply voltage onconductor 32 by resistors 88 and 89, which are connected in seriesbetween conductor 32 and ground conductor 23.

Conductor 87 is coupled by resistor 91 to threshhold adjusting conductor16, which is connected to a suitable output port of microcomputer 10 toprovide computer controlled adjustment of the threshhold of circuit 15.A capacitor 90 is also connected between conductor 87 and groundconductor 23 to provide a time constant of approximately 15 millisecondswith resistors 88, 89 and 91.

As mentioned above, the pitch trigger signal produced on conductor 14 bypeak detector circuit 13 has amplitudes that vary considerably.Threshhold limiting circuit 15 performs the function of selecting onlythe highest amplitude pulses of these peak signals in each period. Inthreshhold limiting circuit 15, the ratio of resistors 88 and 89establishes a reference level on the negative input of operationalamplifier 85. This reference level is approximately 1.8 volts. Anyvoltage on conductor 14 that exceeds this 1.8 volts threshhold levelwill cause a positive pulse to be produced on conductor 17.

For a normal amplitude voice uttered from several feet away from themicrophone, the 1.8 volts reference level produces a single pitch pulseper pitch period of the speaker's voice on conductor 17, for most voicedsounds. However, if a speaker has an exceptionally weak voice, or lowersthe volume at the end of a word or is standing far enough away from themicrophone, then few or no pulses will be produced on conductor 17 ifthe reference voltage remains equal to 1.8 volts. In this event, themicrocomptuer 10 produces a sequence of "0" level pulses on conductor16, which effectively reduce the charge on capacitor 19 temporarily, andthus lower the threshhold level at the negative input of the operationalamplifier in order to pass the peak of the smaller pulses on conductor14.

Conversely, strongly voiced phonemes having several pulses per pitchcycle, or the speaker speaking too close to the microphone, result ingeneration of too many pitch pulses. In this event, microcomputer 10produces a sequence of "1" level pulses on conductor 16 to increase thecharge on capacitor 90 temporarily, in order to reduce the number ofpulses passed on conductor 17.

The microcomputer 10 may determine that if there are too many positivepulses being produced on conductor 17, the sound being received is not a"voiced" sound because no human vocal cords are capable of producingpositive pressure waves with such rapidity. Other criteria involved inthe subsequently described "autocorrelation" analysis also can be usedby the microcomputer program to appropriately adjust the threshholdvoltage adjustment signal on conductor 87.

In order to cause the pitch pulses produced on conductor 17 to have theproper characteristics for processing by conventional digital logiccircuitry, it is necessary to "shape" them. This is done by the pulseshaper and latch circuit 18 shown in FIG. 12. This circuit receives theselected pitch trigger signals on conductor 17 and couples them by meansof 780 picofarad capacitor 93 to conductor 94. Conductor 94 is coupledby 24 kilohm resistor 96 to ground conductor 23 and by 3.6 kilohmresistor 95 to the base of NPN transistor 97. The emitter of NPNtransistor 97 is connected to ground conductor 23 and its collector isconnected to conductor 98. Conductor 98 is coupled by 10 kilohm resistor99 to 5 volt conductor 32. Conductor 98 is coupled to the S* (set) inputof an RS latch 92, the R* (reset) input of which is connected toconductor 20. As previously explained, conductor 20 is connected to anoutput port of microcomputer 10. The Q* output of latch 92 is connectedto the IRQ (interrupt request) conductor of microcomputer 10.

The relatively wide pitch trigger pulse on conductor 17 may be as muchas 300 microseconds wide. Relatively small capacitor 93 and relativelylarge resistor 96 differentiate the pitch trigger signal, therebyproducing a much narrower pulse on conductor 94. By applying this narrowpulse to the base of NPN transistor 97, which functions as an inverter,a narrow (approximately 10 microseconds) pulse is applied to the S*input of latch 92. When the latch 92 is set, the Q* output goes to a lowlevel, causing an interrupt request flag to be set inside microcomputer10. After responding to and evaluating the interrupt request,microcomputer 10 can then reset latch 92 by means of conductor 20, oftenblanking out some of the spurious strong pulses, since the pitchduration has been established.

Microcomputer 10 of FIG. 1 can be implemented by a Motorola MC68701microcomputer with 2048 bytes of programmable program memory and readonly memory. Average instruction execution times of 2 to 4 microsecondsare required. The microcomputer must have 16 bit arithmetic capability,and 128 bytes of scratch pad memory. The microcomputer must also becapable of measuring elapsed times between two input events as closetogether as 100 microseconds and with a resolution of 2 microseconds andmust be capable of forming internal and external timed alarms ofequivalent resolution.

The microcomputer must be capable of providing parallel output data orserial codes representing identified sounds, phonemes, and allophones tovarious devices, such as desk-top computers which receive the signals ascommand words, or to a phonetic typewriter, or to tactile matrix devicesto stimulate the skin of deaf persons, or to a low bit communicationsmodem, or to a "hands off" instrument control panel, etc. Other suitablemicrocomputers that could be used include the Hitachi HD6301, the Intel8096, or the Motorola MC68HC11.

The software executed by microcomputer 10 includes a "foreground"analysis routine which gathers information and stores it in its internalrandom access memory. This software also includes a "background"analysis program that includes an algorithm which processes the gatheredbinary data produced by the "foreground" routine and determines how itshould be processed and when to take appropriate action. The heart ofthe background program executed by microcomputer 10 is shown in the flowchart of FIGS. 13A and 13B. The data that is stored in appropriatelocations of the random access memory includes binary data representingthe real times of the transitions between levels produced on conductor 9in FIG. 1. These "captured" transition times are used to compute timeintervals in accordance with the foreground flow chart of FIG. 14.

A "Glossary of Terms" is shown in Table 1, listing explanations of keyterms used in the flow charts appended hereto.

                  TABLE 1                                                         ______________________________________                                        GLOSSARY OF TERMS                                                             ______________________________________                                        PVAL       The (time) duration of any positive-going                                     slope in the audio waveform between peak                                      points of inflection.                                              NVAL       The duration of any negative-going slope                                      in the audio waveform between points of                                       inflection, including any longer periods                                      of zero slope as found in very weak                                           fricatives or silent intervals.                                    PAVG       A rolling average of successive PVAL time                          or         intervals using a number of stages of                              P*         software filtering in order to indicate                                       the trend of the P value.                                          NAVG       A rolling average of successive NVAL values                        or         using a number of stages of software                               N*         filtering in order to indicate the trend                                      of the N value.                                                    Q          The sum of any successive pair of PVAL                                        and the following NVAL times which                                            determines the total period of this                                           resonant cycle.                                                    QAVG       The sum of NAVG and PAVG.                                          SILENCE    Any N greater than some constant of time,                                     i.e., 5 milliseconds.                                              VALID P    The first P greater than some constant                                        of time, i.e., 200 microseconds, which                                        follows a SILENCE interval.                                        FRICATIVE  A region of the audio waveform                                                characterized by short P durations, i.e.,                                     under 400 microseconds for PAVG,                                              interdispersed with various random                                            times of N.                                                        VOICED     A region of the audio waveform                                                characterized by various durations                                            of orderly Ps of moderate length,                                             i.e., over 450 microseconds,                                                  interdispersed with various durations                                         of orderly Ns of similar durations.                                           These orderly sequences repeat in                                             a cyclic fashion every 1, 2, 3, 4, . . .                                      sets of Q.                                                         DELTA P,   The time differences between the P and                             DELTA N    N durations stored in sections of                                             the rolling average software                                                  filters, respectively.                                             SWOOP      DELTA P and/or DELTA N have exceeded                                          prescribed values indicating that the                                         speaker's articulating apparatus is                                           moving to another position.                                        TRAJECTORY The direction of motion in a swoop                                            calculated by using the magnitude and                                         sign of both DELTA values.                                         PITCH CYCLE                                                                              The time required for a cycle pattern                                         of Q values to repeat.                                             PITCH PULSE                                                                              A pointer indicating the region in                                            each PITCH CYCLE where a new burst                                            of energy from the vocal cords has                                            arrived to initiate the next cycle of                                         cavity resonances.                                                 P1, N1, P2 The three significant segments following                                      the PITCH PULSE. These pressure wave                                          segments are least influenced by                                              the "personality" of the speaker's                                            overall resonant cavity details.                                   PITCH      The predicated point in time where the                             TIMEOUT    next trigger should be. It is used to                                         control the pitch trigger threshhold                                          activity and to re-access the                                                 classification of the sound (silence,                                         fricative, mixed, voiced) and the                                             detailed identity of the phoneme.                                  DWELL      A long period of time where DELTA PAVG                                        is below a prescribed rate of change,                                         indicating that the speaker is                                                attempting to sustain a particular phoneme.                        ______________________________________                                    

In FIGS. 13A and 13B, the program for operating on such data gathered bythe "foreground" routine of FIGS. 14, 15 and 16 is entered via label 100after the foreground analysis is finsihed. The program of FIGS. 13A and13B goes from label 100 to decision block 101, where the program waitsfor the "data ready" signal from the foreground program signifying thatnew data is in the RAM (random access memory). If the needed data isavailable in the RAM, the program rapidly determines if the present dataindicates a present condition of "silence", or if a fricative sound isbeing received, or if a voiced sound (one produced by vocal cords) isbeing received. If it is determined in decison block 1 that no new datahas arrived, the program moves back to label 100.

However, if new data has arrived, the program goes to decision block102. In decision block 102, if a determination is made that there ispresently the condition of silence, the N value will be too long, andthe program recognizes that such long values of the N duration are notpossible in human speech. More specifically, the program determines thatif the N duration is more than a prescribed constant, such as 5milliseconds, then the present condition is one of silence, and theprogram then enters block 105 and "accumulates" silence time by countingsuccessive "time-out" alarms until a valid P interval indicative offricative or voicing activity is detected.

If a determination is made in decision block 118 that the presentsilence time is "very long", the program goes to block 117 and sets a"long silence" condition, indicating, for example that the speaker hasleft and, for example, the equipment can be put on a standby conditionuntil new data arrives. If the determination of decision block 118 isthat the present silence condition is not one that can be categorized asa "very long silence", the program goes to decision block 120 anddetermines if the criteria for a "medium silence condition" is met bythe time count that has been "accumulated" in decision block 105.

Decision block 120 causes the program to determine if the accumulatedsilence time corresponds to a normal pause time that occurs betweenwords and phrases in ordinary human speech. If the determination ofdecision block 120 is affirmative, the program goes to decision block119 and categorizes the silence as a "pause". For example, in a wordprocessing application, a silence of a particular length could beinterpreted as meaning that a carriage return operation should beeffectuated, while a shorter pause can delineate word separation. Afterthe categorization of block 119 is complete, the program returns tolabel 100 and waits for new data to arrive.

If the determination of decision block 120 is negative, the program goesto decision block 122 and determines if the accumulated silence timefits into the category of being a "short silence" of the type that canbe extremely useful in phoneme recognition. These short silencedurations are caused by lung "pressure buildups" that occur, forexample, as a result of a glottal "catch" that precedes a plosivephoneme, such as a "p" sound. Glottal catches are usually associatedwith the change in the position of the articulators of the mouth,throat, and tongue mechanisms that precede resuming of "voiced" soundsof speech as speech is continued. These short sound durations aretypically in the range of 30 to 150 milliseconds. Durations of the"medium silence" times tested for in decision block 120 are typicallylarge fractions of a second.

Note that the time which elapses between pitch periods, or even betweenpositive pressure waves, is sufficiently great that microcomputer 10usually has enough time to execute the entire flow chart of FIGS. 13Aand 13B and FIGS. 14-16 between such pulses.

The prevailing speech recognition being used in the art has tended touse fast analog-to-digital conversion and then use fast mathematicalanalysis requiring large amounts of data storage of variables and veryhigh instruction execution rates to accomplish phonemic analysis.

However, in my invention the unique balance of analog "hardware" withthe amount of computer "software" and the speed with which it can beexecuted by a single chip microcomputer obviates the need for extremelyfast microcomputer components.

Returning now to decision block 102, the determination of this decisionblock is negative if the newest "foreground" data parameters stored asvariables in the random access memory contain positive pressuredurations of significance persistance, for example, if PAVL is over 150microseconds, or if PAVG is over 50 microseconds.

The program then goes to decision block 103. If the determination ofdecision block 103 if the PAVG positive pressure wave average durationis too short for human voicing, then the determination of decision block103 is affirmative. This means that the sound is a fricative, a tappingor clicking sound, or some other background sound. The program then goesto decision block 104 and determines if the sound is just beginning toproduce a significant PAVG time duration. If this is the case, theprogram goes to block 106 and bypasses some of the processing steps thatare performed on comparatively long voiced sounds, because it is nowknown that the present sounds are unvoiced, short duration pulses. Byforming a fricative identification vector consisting of PAVG for onecoordinate of a vector map and NAVG for the other coordinate of thevector map, the identity of fricatives such as SH, F, TH, and so forthcan be determined from a stored matrix or look-up table of the typeshown in FIG. 17 and described in detail in my prior U.S. Pat. No.4,284,846. Fricative sounds have no "voiced" components. Therefore theyare speaker-independent, i.e., pitch-independent. FIG. 17 shows thetypical regions where various fricatives are located, as I haveempirically determined.

FIG. 17 is a vector map of the fricative regions plotting N* (NAVG)versus P* (PAVG), and is referred to in block 123 in the program of FIG.13A. Fricatives are composed of rather short duration positive pressurewaves separated by relatively long durations of inactivity or gapsbetween the positive pressure waves. The boundaries delineated by thedotted lines in the fricative vector map of FIG. 17 show the approximateregions where the indicated fricative sounds such as "H", "F", and thevarious other fricative symbols shown in FIG. 17 that are defined in myprior patent, which is incorporated by reference herein. This fricativecategorization process gets carried out in block 123 of FIG. 13A. Theprogram then goes to block 124 and assigns a predetermined code to thefricative that was identified by reference to the look-up map or tablein block 123, and waits for the fricative to be completed. In decisionblock 125, the program determines if the present fricative is completeby waiting until either silence occurs (i.e., N exceeds a certain value)or a voiced sound occurs (i.e., PAVG exceeds a certain value). When thefricative has been completed and terminated the program goes to block126 and re-configures the appropriate "foreground" operations of FIGS.14-16 to the more elaborate analysis required for voiced sounds, andthen returns to label 100 of FIG. 13A.

Returning now to decision block 103, if the determination of decisionblock 103 is that PAVG is too long for the present sound to be africative, then this negative determination causes the program to go toblock 107 in FIG. 13B. The present sound then is probably a voicedsound, or perhaps a musical sound. In FIG. 13B, received new"foreground" data, in the form of durations of PVAL and NVAL between thevarious transitions of the binary waveform on conductor 9 in FIG. 1 andvalues of PAVG, NAVG, QAVG and the velocity DELTA P and the velocityDELTA N computed in accordance with FIGS. 14-16, are used to compute thepitch period in block 107, using an autocorrolation technique, whereinthe program takes data previously computed in accordance with FIGS.14-16 from a queue, and, using conventional autocorrolation techniques,determines if the previous pattern of received data closely matches thepresent pattern of received data to thereby identify repetitive orperiodic matching and thereby determine the pitch period. The program,using the presently computed pitch period, goes to block 108 and updatesthe continuing average pitch period. This value represents the presentpitch of the voiced sound being spoken. The averaging is necessary toeliminate minor "wavering" in the pitch period.

The program then goes to decision block 110 and determines if there isone pulse per pitch period. If this determination is negative, theprogram goes to block 109 and causes microcomputer 10 to adjust thepitch pulse sensitivity by appropriately varying the threshholdadjustment voltage on conductor 16 of FIG. 1, as well as by monitoringthe clearing of the pulse latch using conductor 20 from themicrocomputer 10. The program then returns to block 107.

This process repeats until the threshhold adjustment value has beenvaried enough, combined with the blanking of spurious trigger pulses inthe reset latch, that there is one pulse per pitch period, as determinedin decision block 110. If the determination of block 110 is affirmative,the program goes to decision block 111. The P values and N valuesobtained in voiced sounds in the data received from the RAM can undergorelatively rapid variations, especially during transition betweenphonemes. In block 111, the program determines if such a rapidtransition is occurring, i.e., if a "vocal swoop" is occurring. If thisdetermination is affirmative, it means that the speaker is rapidlymoving his articulators, which make up the phoneme-producing structureof the speaker's speech apparatus.

The program then goes to block 127 and computes a "swoop trajectory".This can be graphically illustrated by considering a plot of theinstantenous P values versus instantaneous N values, and determining ifthe present rate of change of the P and N values fall closest to a 0° orclosest to 30° multiples of the complete 360° of possible swoopdirections. This "direction" is stored and provides a rapid indicationof the "directions" in which the articulators of the speaker's mouth aremoving, as indicated in block 128.

These swoop trajectory directions are helpful in identifying voiceplosives at the start or finish of a sequence of sounds. If thedetermination of decision block 111 is that a vocal swoop is notpresently occuring, the program goes to block 112 and computes a pitchfactor equal to the average pitch divided by a sum comprised of arunning averge PAVG plus the running average NAVG. This sum representsthe average reverberation cycle time of the present phoneme in thespeaker's speech apparatus.

The program then goes to block 113 and, using the time of occurrence ofthe pitch trigger pulse, locates a "tag" on the most significantpositive wave front duration times, namely P1 and P2 of the presentpitch cycle. These most significant segments occur after the PITCH PULSEpointer.

In accordance with an important aspect of the invention, theseindividual time values of P1 and P2 are adjusted by any remainderobtained in the division computed in block 112. Block 114 indicates themaking of this correction.

This step is believed to be important in establishing a high degree ofspeaker-independence of the phoneme recognition method of the presentinvention. This can be done by using the remainder of the above divisionto access a stored look-up table from which empiracally determinedquantities can be obtained for addition to or multiplication by the P1and P2 values. Should there be no remainder, then the reverberationsequence of the phoneme is harmonic with and exactly fits with thepresent pitch period, so no correction is required.

The program then goes to block 115 and, guided by the decision tree ofFIG. 3, categorizes the voiced phoneme on the basis of the best matchbetween the voiced phoneme vector domain map of FIG. 18 and the vectorformed between the "adjusted" values of P1 and P2, and also on the basisof the swoop trajectory direction of block 128, and obtains a code thatrepresents the present voiced phoneme. (Note that in block 128, thesetting of the swoop trajectory direction means writing this informationinto random access memory, thereby making this information readilyavailable for the steps in block 115.)

The usefullness of the swoop trajectory lies in the fact that forcertain phonemes, it is not possible to determine what the prior phonemewas until the next phoneme is being uttered. For short phonemes, thereal-time phoneme recognition will occur very soon after the actualutterance thereof. For rapidly spoken short fricative sounds andstaccato sounds, the phoneme identification will lag by approximatelyone phoneme behind, because for such sounds the "total evidence"produced by the various positive pressure wave transitions, includingthe swoop trajectories produced during the transitions between the priorphoneme and the following one, must be converted to binary data andanalyzed before correct identification of the phoneme can take place.For long, slow vowels, the identification of phonemes will be madeduring the time that the vowel is being uttered. Such matters arehandled by block 116.

The information available to the program for phoneme identifyingdecisions includes information as to the existence of silence durationsof the various lengths and the time of occurrence of the beginning ofevery pitch cycle. The program, in block 116, analyzes and uses thisinformation, and also information regarding the presence of a fricative,information indicating the identification of the fricative. informationas to the swoop trajectory direction, the pitch factor, and theremainder correction factor in order to determine the identity of themost recent phoneme from the voiced phoneme vector domain map of FIG.18. If the same result is obtained in two consecutive pitch periods ordefault values of the pitch period during fricative and silenceintervals, then the phoneme-identifying code will be output bymicrocomputer 10 to a suitable receiving device via bus 21.

This procedure is indicated in the output or executive decison routinethat is entered from label 130. The step of waiting for the pitchtime-out, which can be a real pitch time-out or a default value, isperformed in block 131. If the same phoneme "candidate" is consecutivelyidentified twice, the output decision is made in block 132, and thephoneme-identifying code is transmitted in accordance with block 133. Areturn to the background analysis occurs via return labels 134 and 100.

Referring to FIG. 18, I have empirically determined that generallyincreasing values of P1 indicate a moving of the tongue position fromthe front of the mouth to the back of the mouth, as indicated by arrow279. Increasing values of P2 in FIG. 18 tend to indicate moving of themouth from a relatively open or slack configuration to a closed positionso that a partial fricative type of sound is produced. For example,vector 280 in FIG. 18 represents a fairly "loose" or "open" vowel.Vector 281 identifies a back vowel that is close to the point of being africative, for example, the germanic back vowel "hoch" the German wordfor high. This vector map is the one which is referred to in block 115of the flow chart of FIG. 13B.

In the process of locating the phoneme from the voiced decision map ofFIG. 18 and/or the fricative decision map of FIG. 17, empiricallydetermined vector region boundaries have been set up which much becompared with the input vector. Each examination is carried out by meansof a series of boundary comparisons that involve execution of a sequenceof microcomputer instructions. Two factors which are considered in thisdecision process include determining the basic category of the presentsound, i.e., whether it is a fricative, voiced, or mixed sound, orsilence and the manner in which the look-up vector map should besearched.

The foregoing description of FIGS. 13A and 13B indicated how the basiccategory of sound is determined in accordance with the presentinvention.

Next, referring to FIG. 3, a phoneme map is shown where each branchradiating from the center of the map includes a sequence of phonemesbased on their frequency of usage in average American or English speech.

The search of each branch can be done in a sequence based on thisdecreasing order of usage of phonemes, as represented graphically by thedecreasing size of the rectangles. The probabilities, therefore, are infavor of the location of the present phoneme in the shortest averagesearch time, usually within five matching tests. If no match is found inthe course of searching from the center of the phoneme map to the end ofthe particular branch, the search is exited from that branch and returnsto the center of the phoneme map. Then, based on broad evidence, theroutine searches through the next most likely branch, etc.

The main phonemic branch "directions" radiating from the center of thephoneme map include:

East: Plosives

North: Fricatives

Northwest: Back vowels

West: Mid vowels

Southwest: Front vowels

South: Liquid, nasal sounds

These techniques for matching greatly reduce search time needed to copewith other languages. Certain phonemes can be added to and/or deletedfrom the phoneme decision tree shown in FIG. 3. Also, new sequencesbased on frequency of usage in another language would be required toobtain fastest average searching.

Next, the "data gathering" activities performed by the subroutines ofFIGS. 14-16 will be described. These subroutines operate on the"captured" times of occurrence of every transition of the binarywaveform on conductor 9 of FIG. 1, and compute updated values of P1, P2,N1, PAVG, NAVG, DELTA P, DELTA N, etc. These subroutines are referred toas "foreground" interrupt routines, and the data they produce isreferred to as "foreground data", whereas the routines of FIGS. 13A and13B are referred to as "background" routines.

Referring to FIG. 14, a foreground subroutine executed by microromputer10 for jumping to interrupt flags and measurirg intervals based on thebinary waveform on conductor 9 is shown. This foreground subroutineperforms the data acquisition function. If a falling edge of the binarywaveform is received, then the time of the interrupt is automaticallyrecorded in the capture register 11 of microcomputer 10 (FIG. 1),indicating the "1" level is over, and the subroutine is entered vialabel 150.

In block 151, the microcomputer 10 clears a contingency software timeout alarm for long "0" durations and enters block 152. Block 152 of theprogram computes the duration of the just completed positive voice levelor "P" value and assigns it to a variable called PVAL, measured inmicroseconds. The program then goes to decision block 153 and determinesif there is an interrupt request signal on conductor 19 of FIG. 1. Ifthe determination is affirmative, the program goes to block 154 and setsa variable called "PULSE" equal to a logical "1". If not, the routinegoes to block 155 and sets the value of the variable "PULSE" equal to alogical "0", indicating that no pitch pulse has been recently receivedon conductor 19. In this case, the routine enters block 156 and computesthe velocity and acceleration, and also the "trend" or running average,PAVG of the instantaneous value PAVL.

All foreground parameters are made available to the background analysisroutine of FIGS. 13A and 13B when the "data ready" flag is set.

The foreground routine of FIG. 14 then enters decision block 157 anddetermines if the voice level has already gone back to a high level. Ifit has, then it is known that the duration of the present "0" level isvery short. This determination is made because it is known that thesequence of events from label 150 to the end of block 156 requires acertain amount of time, and ordinarily the following edge level shouldnot have reversed direction yet. If the direction has reversed, thisindicates that "splitting" of the negative-going waveform represented bythe high level presently on the binary waveform conductor 9 hasoccurred. If splitting has occurred, the routine goes back to block 152and repeats the foregoing sequence of steps, in order to make a completeanalysis of the positive undulation that must have occurred in thenegative slope of the speech waveform.

If the determination of decision block 157 is negative, the programenters block 158, which is a software "silence" time out alarm (shouldthe present "0" level be the start of silence) and then exits the binarywaveform evaluation routine of FIG. 14. (The execution of thisinterruption path is designed to be as fast as possible in order tomaximize the time spent in the background routine of FIGS. 13A and 13B.)

Alternatively, if a rising edge is received on the binary waveformconductor 9, label 160 of the flow chart of FIG. 14 is entered byinterruption from the background routine of FIGS. 13A and 13B. Thissubroutine first goes to block 161 and computes the duration of the justcompleted "0" level and assigns it to a variable NAVL measured inmicroseconds. The subroutine then goes to block 162 and computes thevelocity, acceleration, and trend or average value of NVAL. The programthen goes to block 163 and computes the present resonance cycle durationQ, which is equal to the sum of the instantaneous values of NVAL andPVAL.

The program then goes to decision block 164. If the binary level ofconductor 9 is already low, then a minimum default time of PVAL is set,as indicated in block 165, and the routine goes back to block 152. Ifthe voice level is not low, the routine is exited.

Referring now to FIG. 15, if glottal pulse (pitch trigger) is received,the routine is entered via label 170. This routine then enters block 171and sets a flag indicating that a pitch pulse has been detected. Theroutine then goes to decision block 172 to wait for "go ahead" to resetthe pulse latch 15 of FIG. 1 and exits.

Referring next to FIG. 16, if there is a long duration of a "0" binarylevel, the time-out alarm will occur, as indicated in block 175, becauseof too long of a negative interval on binary waveform 9 without theocurrence of a rising edge. If this occurs, the subroutine enters block176 and forces NVAL to have a prescribed time out value. The programthen goes to block 177, where an artificial value of PVAL is forced inorder to stabilize the trend of PAVG. Next, the routine enters block178, which forces all computation involved with the P values, and thenexits.

Appendix A attached hereto is a printout of a computer programrepresented by the flow chart of FIGS. 13A and 13B. Similarly, AppendixB is a printout of the program represented by FIGS. 14-16.

In accordance with another embodiment of the invention, FIG. 19 shows avoice command module 180 which includes some of the hardware of FIG. 1,including microphone 2, microphone signal amplifier 3, audio band passfilter 5, inflection point detector 8, and a microprocessor 10 as shownin FIG. 1, but does not include the "pitch extraction" circuitryincluding blocks 6, 13, 15, and 18 of FIG. 1. Voice command module 180is coupled by means of a cable 182 to a typical computer, such as a desktop computer 183.

A prompting light 186 is used to guide the speaker. The microphone 181picks up ambient noise as well as the speaker's voice, and in generalaccordance with the background routine of FIG. 13A, evaluates theexisting noise level to establish if an adequate low level of "silence"exists to merit making a voice record. When one half second of "silence"has been achieved, the prompting light 186 indicates to the speaker thata valid "listening" condition exists.

In this embodiment of the invention, the incoming binary waveform onconductor 9 is operated upon generally in accordance with the"foreground" subroutine previously described with reference to FIGS.14-16. However, voice command module 180 makes no attempt to create orcompare the phoneme vectors previously described. Instead, the"foreground" data, including the above-mentioned P* and N* averages arecomputed and are output via conductors 21 in FIG. 1, which are theconductors of the cable 182 in FIG. 19. Microprocessor 10 in the voicecommand module 180 outputs samples every 10 milliseconds of the P* andN* samples.

The information required to utilize this "sampled" running average datais contained in a program stored on floppy disc 184 in FIG. 19, which isthen inserted into computer 183, as indicated by arrow 185. This programoperates on the sampled running average data to compare it with stored"reference" data of a similar kind previously stored on floppy disc 184in response to words spoken by the speaker whose voice commands are tobe recognized during a "training session." These reference words aredigitized and sampled by the voice command module 180 and stored onfloppy disc 184.

The manner in which the voice command system 179 of FIG. 19 operates canperhaps be best understood by reference to a particular example.Referring now to FIG. 20, two graphs 195 and 196 are shown, whichrespectively represent the durations plotted against time, of the N*samples and the P* samples output on cable 182 by voice command module180. More specifically, each ordinant of waveform 195 represents thepresent value of the sampled running average of the negative pressurewave portions of the incoming speech waveforms. Similarly, each ordinantof waveform 196 represents the present value of the sampled runningaverage of the durations of the positive pressure wave portions of theincoming speech waveform.

Data showed in waveforms 195 and 196 in FIG. 20 corresponds to storeddata for a particular person's utterance of the word "sappy", which ischaracteristic of a word with a silent interval preceding a plosive.

The program stored on floppy disc 184 (FIG. 19) operates on the data ofFIG. 20 to isolate and produce data corresponding to "significantevents" that can be identified from the features of waveforms 195 and196. More particularly, these significant events include thecharacteristics listed in Table 2 below:

                  TABLE 2                                                         ______________________________________                                               State   Symbol                                                         ______________________________________                                               RISE    R                                                                     FALL    F                                                                     SILENCE S                                                                     DWELL   D                                                                     CREEP   C                                                              ______________________________________                                    

As indicated in Table 2, R designates a rapidly rising portion of the P*waveform 196. The rapidly rising portions of waveforms 196 areidentified by R. Similarly, rapidly falling portions of P* graph areidentified by F. S designates periods of silence.

Similarly, D and C designate portions of the P* graph which arerelatively constant, and which change only very gradually, respectively.

The floppy disc 184 contains a simple routine which identifies the Rstates in accordance with the following formula:

    [(P*.sub.n -P*.sub.n-1)+(P*.sub.n+1 -P*.sub.n)]

greater than 96 microseconds, where n is the sample number.

Similarly, the state identification subroutine produces an F state inaccordance with the expression

    [(P*.sub.n -P*.sub.n-1)+(P*.sub.n+1 -P*.sub.n)]

greater than 96 microseconds

The state identification subroutine identifies the S state in accordancewith the expression

    N*>1700 microseconds.

The D state is identified in accordance with the condition that themagnitude of P*_(n) -P*_(n-1) is less than or equal to 48 microseconds.

Finally, the C state is identified by determining whether P* changesslowly by at least 64 microseconds during a D state.

Occurrence of any of the above events R, F, S, D, and C are determinedand the duration of each state is measured and stored in the memory ofcomputer 183 and/or on the floppy disc 184. Occurrence of any of theabove states is treated as a "event", and corresponding "event data" isgathered describing that event.

In FIG. 21, there are shown three bytes identified as bytes 1, 2, and 3,respectively, that represent the event data corresponding to one event.Byte 1 includes at least three significant bits that identify which ofthe five states R, F, S, D, and C have occurred. Bits 3-7 of byte 1represent the duration of the most recent event. Byte 2 includes thevalue of the running average P* at the end of the present event. Byte 3represents the present value of the running average N* at the end of thepresent event. The "duration" of the present state is the number ofsamples, each spaced 10 milliseconds from the last, that the presentstate lasted.

It can be seen that the foregoing approach of producing event data andspecifying, in each three byte group of event data, the duration of thatstate obviates the need to produce data representing all of the pointsof time during which the corresponding state persists. This greatlyreduces the amount of data which must be processed and stored, andgreatly decreases the amount of time required for a processor to performsubsequent operations on the data, such as, matching it with previouslystored reference data, or to reconstruct the analog signal, or toprovide "narrow bandwidth" transmission of the speech data.

In accordance with one embodiment of the present invention, an importantapplication of the foregoing technique is the accomplishment of"speaker-dependent" voice command recognition, which is achieved by thevoice command system 179 of FIG. 19, wherein the event data is comparedwith previously stored reference data that is obtained by "training" thesystem, a term understood in the art to mean entry of data correspondingto particular words spoken by a particular person into the system, whichwords are to be later recognized as comands by the system when spoken bythe same person.

The graph in FIG. 22 illustrates how incoming event data 188 is comparedto reference event data 187 previously stored during the "training" ofvoice command recognition system 179. Once the spoken commands arerecognized, of course, computer 183 operates in response to the commandsin precisely the same way as if the commands were to be entered by meansof a keyboard.

Still referring now to FIG. 22, the previously stored "reference events"187 are plotted on the vertical axis, and are identified by eventnumbers, each event number 1, 2, . . . 24, corresponding to a"significant event" (i.e., R, F, S, D, or C) of the voiced command"learned" by the system 179 during the "training" session. During that"training" session, the comands are spoken into the microphone 181 ofthe voice command module 180, resulting in "raw" sampled reference eventdata produced in the format of FIG. 21 and stored in a predeterminednumber of preselected locations on floppy disc 184.

In the course of recognizing a real-time voice command later spoken intomicrophone 181, "unknown" utterance events of the present real-timecommand being spoken are assigned similar event numbers, which eventnumbers 1, 2, . . . 20, are generally indicated by reference numeral 188in FIG. 20.

In FIG. 22, reference numeral 189 identifies a "window" including agroup of four significant events of the present spoken word to berecognized. The event data for each of the significant events in window189 is compared to the previously stored reference event data for storedreference event No. 1 until a matching occurs. The comparison program ofFIG. 23, subsequently explained, is stored on floppy disc 184 and isexecuted by computer 183 to determine if any matching occurs. If a matchdoes occur, the comparison program moves to a new "window" that beginsimmediately after the matching occurs. If no matching occurs, the samewindow of four significant unknown events of the present utterance iscompared to the stored reference event data No. 2, as indicated byreference number 190 in FIG. 22.

In the example of FIG. 22, the graph indicates that no matching of thefirst window of four significant unknown events of the present utterancesufficiently closely matches stored reference event data No. 1 or No. 2,so the program applies the same data window to stored reference eventdata No. 3, as indicated by reference numeral 191. In this case, it isfound that unknown event data No. 2 matches stored reference event dataNo. 3 by the comparison routine of FIG. 23. The X in data window 191indicates this matching. Then, the comparison routine of FIG. 23 repeatsits comparison operations for another unknown events window 192, whichbegins immediately after the X in window 191.

If a predetermined number, for example six, of "mismatches" such asoccurred in windows 189 and 190 occur, then the comparison routine inFIG. 23 determines that the present utterance cannot be recognized anddoes not match the stored reference event of 187 in FIG. 22. The routinethen goes to the next stored reference word and repeats the aboveprocedure. What this technique accomplishes is that it identifiesmatching features between the spoken command and the previously storedcommand, despite "absence" or "addition" of extraneous features that mayoccur each time the word is spoken. The width of the window such as 189and 190 allows effective "synchronizing" of the comparison of thesequence of features in the presently spoken command to those in thepreviously stored reference command, despite these extraneousdifferences in the significant features of the same word when it isspoken at different times by the same speaker.

Referring next to the flow diagram of FIG. 23, which corresponds to theprintout of Appendix C attached hereto, the above-mentioned signficantevent comparison subroutine is entered via label 200, and goes to block201. In block 201, two variables called SCORE and STRIKE are each set tozero. The routine then goes to block 202 and sets an utterance eventpointer UT equal to 1. The routine then goes to block 203 and loopsthrough the previously stored reference event numbers designated byreference numeral 187 in FIG. 22, incrementing a reference pointer RR,which indicates at which matching unknown utterance event to begin thenext window. UU in blocks 205 and 206 points to unknown utterance events188.

The routine then goes to block 204 and sets the value of four variablesSUBSCORE(1) . . . SUBSCORE(4) to zero. The later four variablescorrespond to "scores" which are computed for each of the four events ineach of the event windows in FIG. 22.

The significant event comparison routine then goes to block 205 andloops through the present utterance event window, and, for eachcomparison of an utterance event in a particular window with thecorresponding previously stored reference event, obtains a "differencenumber", assigns to it a "weighted" score, and sets the appropriate oneof the variables SUBSCORE(1), etc. to that score, based on thedifference between the utterance event and the stored reference event.

After the four subscore variables have values assigned to them, theroutine then goes to decision block 207 and determines if there are moreevents in the present window. If this determination is affirmative, theroutine goes back to block 205 and repeats. If the determination isnegative, the routine goes to block 208 and selects the smallest of thefour subscores. The routine then goes to decision block 209 anddetermines if the value of the smallest subscore variable is zero. Ifthis is the case, it indicates a perfect match of the correspondingunknown utterance event with the present reference event, and theroutine then goes to block 210.

In block 210, the routine sets the UT variable, which points to thefirst event in the window of the unknown utterance event, to point tothe next event after the perfect match The routine then goes to block213 and sets the variable STRIKE equal to zero, and also diminishes thevalue of the variable score by a suitable gamount to "reward" thesignificant event comparison routine for locating a perfect matchbetween an unknown utterance event in the present window and the presentstored reference event.

In the event that the determination of decision block 209 was negative,the routine enters block 211 and increments the variable STRIKE. Theroutine then goes to block 212 and adds the total amount of mismatch ofeach of the four utterance events in the present window to thecumulative value of SCORE.

The routine then goes to decision block 214 and determines if STRIKEexceeds 5, or another suitable empirically determined number. If thisdetermination is affirmative, it is assumed that the present unknownutterance does not match the present stored reference utterance, and theroutine is exited via block 215.

If the determination of decision block 214 is negative, the routine goesto block 216 and determines if all of the significant events of thepresent utterance have been compared to the stored reference event datafor the present stored reference utterance. If this determination isafffirmative, the routine goes to block 217, and sets STRIKE back tozero, and also suitably diminishes or "rewards" the variable SCORE. Theroutine then is exited via label 219. The routine then returns to label200 and repeats the algorithm of FIG. 23 for additional stored referenceutterances to determine if any of them match the present utterancebetter than the just completed analysis.

If the decision of decision block 216 is negative, the routine entersdecision block 218 and determines if all significant events of thepresent utterance have been tested against stored reference event datafor the present stored reference utterance. If this determination isnegative, the routine goes back to block 203, but otherwise goes toblock 217 and then exits via label 219.

The "rewards" used in block 217 are required in case some other word inthe lexicon has a similar or identical sequence in the early part ofmatching, but exhibits a discrepant latter part. Any word which finishesthe algorithm of FIG. 23 therefore has a large reduction in theaccumulated score (such as a reduction of one half) in order to portrayor encourage a favorable final decision when the scores of all matchedwords are evaluated.

FIG. 24 illustrates another implementation of the system shown in FIG.1, wherein blocks 5, 6, 8, 13, 15, and 18 are omitted, and theirfunctions are performed or accomplished by means of a high speed analogto digital converter of the type referred to in the art as a "flashanalog to digital converter", and by means of two digital filterprograms. More specifically, in FIG. 24, the system 285 includes amicrophone 289, a microphone amplifier 290, the output of which isapplied as an analog input to a microcomputer 286. Microcomputer 286includes a processor 292, which can be implemented by means of any of alarge number of presently commercially available microprocessors. It iscoupled to a digital bus 293, which can include the bus 21 in FIG. 1 onwhich phoneme identification signals are provided, or it can simplyprovide sampled, compact digital representation of information of thetype conducted on cable 182 in the embodiment of FIG. 19.

Computer 286 includes a flash analog to digital converter 288, which canbe implemented by means of several commercially available analog todigital converters that are capable of performing conversions in roughly30 microseconds or less. Microprocessor 292 executes two digital filterroutines, which are schematically indicated in FIG. 24 by block 287.Dotted line 296 schematically represents the execution of the twodigital filter routines 287 by microprocessor 292. Reference numeral 282represents an RC time constant circuit which is coupled by conductor 283to digital circuitry within microcomputer 286 which much be provided andcontrolled by microprocessor 292 in order to execute the two digitalfilter routines. One of the two digital filter routines and associateddigital filter circuitry performs the function of digitally filteringthe voice band components of the analog signal on conductor 291. Itperforms the function of the audio band pass filter 5 in FIG. 1.Reference numeral 284 represents an RC time constant circuit which iscoupled by conductor 295 to additional digital filtering circuitry whichis controlled by microprocessor 292 in the course of executing thesecond digital filter routine in block 287, to provide a properlyfiltered pitch band signal. This circuitry and digital filter routineperform the function of the pitch band pass filter 6 in FIG. 1.

It is within the present capability of the art (although it is believednot yet to have been done) to integrate a flash analog to digitalconverter such as 288 and a microprocessor such as 292 and additionaldigital filtering circuitry onto a single integrated circuit chip. Thiswould be a straight-forward, economical method of implementing largeportions of the system of FIG. 1.

Microprocessor 292 could easily execute a subroutine which would look atthe outputs of the flash analog to digital converter 288 and couldeasily perform a comparison of each point of the output with theadjacent points thereof to detect the peaks of the analog speechwaveform produced on conductor 291. It would be a straight-forwardmatter for the microprocessor to then mathematically determine the majorinflection points and compute all of the various running averages, ratesof change, and acceleration variables that have been previouslydescribed herein. The undesired "intervening" numbers produced by analogto digtal converter 288 could be discarded, to achieve the same degreeof significant event data compaction previously described herein.

While the invention has been described with reference to a particularembodiment thereof, those skilled in the art will be able to makevarious modifications to the disclosed embodiment without departin fromthe true spirit and scope thereof. It is intended that all elements andsteps which perform substantially the same function in substantially thesame manner to obtain substantially the same result are within the scopeof the present invention. ##SPC1##

I claim:
 1. A method of transforming a sound signal having positivepressure wave portions and negative pressure wave portions intocharacter signals, said method comprising the steps of:(a) producing ananalog signal representing said sound signal; (b) producing a binarysignal having a "1" level during positive pressure wave portions of saidsound signal and a "0" level during negative pressure wave portions ofsaid sound signal; (c) detecting the peak amplitude of each pitch cycleof said analog signal and producing a pitch cycle marker signalsynchronized to a maximum peak amplitude portion of said analog signal;(d) producing a first number that represents the duration of a "1" levelof said binary signal most closely following a pulse of said pitch cyclemarker signal and producing a second number that represents the durationof the next most closely following "1" level of said binary signal; (e)producing a character vector having a magnitude and a direction fromsaid first and second numbers and comparing said character vector with aplurality of stored vectors to determine if said character vectormatches any of said stored vectors; and (f) producing a character signalrepresenting a sound corresponding to a one of said stored vectors thatmatches said character vector.
 2. The method of claim 1 includingproducing a running average of the durations of said "1" levels and said"0" levels of a plurality of said pitch cycles of said analog signal andalso producing a summation of the durations of said "1" and "0" levelsof one pitch cycle, dividing said summation by the sum of said runningaverages of said "1" and "0" durations, and, if there is a remainder,using said remainder to obtain a correction factor to effectuate acorrection in said character vector.
 3. The method of claim 2 includingapplying said correction factor to said durations of one of said "1"levels.
 4. The method of claim 2 wherein said using said correctionfactor effectuates correcting of said character vector to a value whichsaid character vector would have if the pitch period of said one pitchcycle were equal to the said sum of said running averages of said "1"levels and said "0" levels.
 5. The method of claim 1 including comparingdurations of succeeding initial "1" levels to durations of lateroccurring "1" levels to determine the change or rate of change of saiddurations of said "1" levels, and, if said change or rate of changefalls within a predetermined range of values, producing a characterdemarcation signal representing a beginning or end of a character soundrepresented by said sound signal.
 6. The method of claim 3 wherein saidcorrection factor is obtained by using said remainder to address alook-up table to obtain said correction factor.
 7. The method of claim 3wherein said sound signal includes a voiced sound portion, and whereinsteps (d), (e) and (f) apply to said voiced sound portion.
 8. The methodof claim 7 wherein said sound signal includes a fricative portion, saidmethod including computing a first running average of said first numbersand a second running average of said second numbers, and computing africative character vector between said first and second runningaverages, and comparing said fricative character vector with a pluralityof stored fricative vectors to determine if said fricative charactervector matches any of the stored fricative vectors, and producing acharacter signal representing a fricative corresponding to a one of saidstored fricative vectors that matches said fricative character vector.9. The method of claim 7 wherein said voiced sound portion includes aplosive portion and a silent portion immediately preceeding said plosiveportion, said method including the steps of measuring the duration ofsaid silent interval and comparing that duration with a plurality ofstored durations, and, if the compared duration matches any of thestored durations, producing a signal that is at least partly indicativeof the type of the plosive portion.
 10. The method of claim 1 whereinsaid character signals each represent a different phoneme contained insaid sound signal.
 11. A system for transforming a sound signal havingpositive pressure wave portions and negative pressure wave portions intocharacter signals, said system comprising:(a) means for producing ananalog signal representing said sound signal; (b) means responsive tosaid analog signal producing means for producing a binary siganl havinga "1" level during positive pressure wave portions of said sound signaland a "0" level during negative pressure wave portions of said soundsignal; (c) means reponsive to said analog signal producing means fordetecting the peak amplitude of each pitch cycle of said analog signaland producing a corresponding pitch cycle marker signal synchronized toa maximum peak amplitude portion of said analog signal; (d) meansresponsive to said binary signal for producing a first number thatrepresents the duration of a "1" level of said binary signal mostclosely following a pulse of said pitch cycle marker signal andproducing a second number that represents the duration of the next mostclosely following "1" level of said binary signal; (e) means forproducing a character vector having a magnitude and a direction fromsaid first and second numbers and comparing said character vector with aplurality of stored vectors to determine if said character vectormatches any of said stored vectors; and (f) means for producing acharacter signal representing a sound corresponding to a one of saidstored vectors that matches said character vector.
 12. The system ofclaim 11 including means for producing a running average of thedurations of said "1" levels and said "0" levels of a plurality of saidpitch cycles of said analog siganl, producing a summation of thedurations of said "1" and "0" levels of one pitch cycle, for dividingsaid summation by the sum of said running averages of said "1" and "0"durations, and means for using said remainder to obtain a correctionfactor to effectuate a correction in said character vector if there is aremainder.
 13. The system of claim 12 including means for applying saidcorrection factor to said durations of one of said "1" levels.
 14. Thesystem of claim 12 wherein said means for using said correction factoreffectuates correcting of said character vector to a value which saidcharacter vector would have if the pitch period of said one pitch cyclewere equal to said sum of said running averages of said "1" levels andsaid "0" levels.
 15. The system of claim 11 including means forcomparing durations of succeeding initial "1" levels to durations oflater occuring "1" levels to determine the change or rate of change ofsaid durations of said "1" levels, and means for producing a characterdemarcation signal representing a beginning or end of a character soundrepresented by said sound signal if said change or rate of change fallswithin a predetermined range of values.
 16. The system of claim 13including means for obtaining said correction factor by using saidremainder to address a look-up table.
 17. The system of claim 13 whereinsaid sound signal includes a voiced sound portion.
 18. The system ofclaim 17 wherein said sound signal includes a fricative portion, saidsystem including means for computing a first running average of saidfirst numbers and a second running average of said second numbers, andmeans for computing a fricative character vector between said first andsecond running averages, and means for comparing said fricativecharacter vector with a plurality of stored fricative vectors todetermine if said fricative character vector matches any of the storedfricative vectors, and means for producing a character signalrepresenting a fricative corresponding to a one of said stored fricativevectors that matches said fricative character vector.
 19. The system ofclaim 17 wherein said voiced sound portion includes a plosive portionand a silent portion immediately preceeding said plosive portion, saidsystem including means for measuring the duration of said silentinterval and means for comparing that duration with a plurality ofstored durations, and means for producing a signal that is at leastpartly indicative of the type of the plosive portion if the comparedduration matches any of the stored duratons.
 20. The system of claim 11wherein said character signals each represent a different phonemecontained in said sound signal.
 21. A method of transforming positivepressure waves and negative pressure waves from a sound signal intophoneme identification signals, said method comprising the steps of:(a)producing an analog signal representative of said sound signal; (b)producing a first bandwidth-limited analog signal representing saidsound signal in response to said analog signal; (c) producing a binarysignal having a "1" level during positive pressure wave portions of saidsound signal and a "0" level during negative pressure wave portions ofsaid sound signal; (d) concurrently, producing, in response to saidanalog signal, a second bandwidth-limited analog sound signalrepresenting low frequency components of said analog signal fordetecting the peak amplitude of each pitch cycle of said analog signaland producing a pitch cycle marker signal synchronized to a maximum peakamplitude portion of said analog signal; (e) producing a first numberthat represents the duration of a "1" level of said binary signal mostclosely following the onset of said pitch cycle marker signal andproducing a second number that represents the duration of the next mostclosely following "1" level of said binary signal; (f) producing acharacter vector having a magnitude and a direction from said first andsecond time duration numbers and comparing said character vector with astored map composed of a plurality of vector boundary regions todetermine which of the said vector boundary regions contains saidcharacter vector; and (g) producing a phoneme identification signalrepresenting a sound corresponding to a one of said stored vectorregions that matches said character vector.
 22. The method of claim 21including producing a running average of the durations of all successiveones of said "1" levels of said binary signal and a running average ofall successive ones of said "0" levels of said binary signal and alsoproducing a summation of the durations of said "1" and "0" levels withineach pitch cycle to represent the instantaneous pitch time period,dividing said summation by the sum of said running averages of said "1"and "0" durations, and, if there is a remainder, using said remainder toobtain a correction factor to be applied to one or both of said firstand second time duration numbers or to said character vector.
 23. Themethod of claim 22 including applying said correction factor prior tosaid producing of said character vector and prior to said comparing withthe stored map of vector boundary regions.
 24. The method of claim 23wherein said applying of said correction factor effectuates correctingof said character vector to a value which said character vector wouldhave if the pitch period of said one pitch cycle were equal to an exactinteger multiple of said running averages of said "1" levels and said"0" levels.
 25. The method of claim 21 including comparing durations ofsucceeding initial "1" levels to durations of the later occuring "1"levels to determine the change or rate of change of said durations ofsaid "1" levels, and, if said change or rate of change falls within apredetermined range of values, producing a demarcation signalrepresenting a beginning or end of a relatively stable phoneme.
 26. Themethod of claim 23 wherein said correction factor is obtained by usingsaid remainder to address a look-up table to obtain said correctionfactor.
 27. The method of claim 22 wherein said sound signal includes avoiced sound portion, and wherein steps (d), (e) and (f) apply to saidvoiced sound portion.
 28. The method of claim 27 wherein said soundsignal includes a fricative portion, said method including computing afirst running average of said first numbers and a second running averageof said second numbers, and forming a fricative vector between saidfirst and second running averages, and comparing said fricative vectorwith a stored map composed of a plurality of vector boundary regions todetermine if said fricative vector falls within any of the storedfricative vector boundary regions, and producing a phoneme or soundidentifying signal representing a fricative corresponding to a one ofsaid stored fricative vector regions.
 29. The method of claim 27 whereinsaid voiced sound portion includes a plosive portion and a silentportion immediately preceding said plosive portion, said methodincluding the steps of measuring the duration of said silent portion andcomparing that duration with a plurality of stored durations andproducing a signal useful in determining to which of a plurality ofplosive phonemes said plosive portion corresponds.
 30. The method ofclaim 27 wherein if either said change or said rate of change exceeds apredetermined value, a velocity slope vector is formed to determine atrajectory useful in predicting from what vector boundary region thevoiced character vector came from and/or to which vector boundary regionthe voiced character vector is going in order to assist in identifyingvery short duration phonemes.
 31. A system for transforming positivepressure waves and negative pressure waves from a sound signal intophoneme identification signals, said system comprising:(a) meansresponsive to said sound signal for producing an analog signalrepresentative of said analog signal; (b) means responsive to saidanalog signal for producing a first bandwidth-limited analog signalrepresenting said sound signal in response to said analog signal; (c)means responsive to said bandwidth-limited analog signal for producing abinary signal having a "1" level during positive pressure wave portionsof said sound signal and a "0" level during negative pressure waveportions of said sound signal; (d) means responsive to said analogsignal for producing a second bandwidth-limited analog sound signalrepresenting low frequency components of said analog signal; (e) meansresponsive to said second bandwidth-limited analog signal for detectingthe peak amplitude of each pitch cycle of said analog signal andproducing a pitch cycle marker signal synchronized to a maximum peakamplitude portion of said analog signal; (f) means responsive to saidbinary signal for producing a first number that represents the durationof a "1" level of said binary signal most closely following the onset ofsaid pitch cycle signal and producing a second number that representsthe duration of the next most closely following "1" level of said binarysignal; (g) means for producing a character vector from said first andsecond time duration numbers and means for comparing said charactervector with a stored map composed of a plurality of vector boundaryregions to determine which of the said vector boundary regions containssaid character vector; and (h) means for producing a phonemeidentification system representing a sound corresponding to a one ofsaid stored vector regions that matches said character vector.
 32. Thesystem of claim 31 including means for producing a running average ofthe durations of all successive ones of said "1" levels of said binarysignal and a running average of all successive ones of said "0" levelsof said binary signal, means for producing a summation of the durationsof said "1" and "0" levels within each pitch cycle to represent theinstantaneous pitch time period, means for dividing said summation bythe sum of said running averages of said "1" and "0" durations, andmeans for using said remainder to obtain a correction factor to beapplied to one or both of said first and second duration numbers or tosaid character vector if there is a remainder.
 33. The system of claim32 including means for applying said correction factor prior to saidproducing of said character vector and prior to said comparing with thestored map of vector boundary regions.
 34. The system of claim 33wherein said means for applying said correction factor effectuatescorrecting of said character vector to a value which said charactervector would have if the pitch period of said one pitch cycle were equalto an exact integer multiple of said running averages of said "1" levelsand said "0" levels.
 35. The system of claim 31 including means forcomparing durations of succeeding initial "1" levels to durations of thelater occuring "1" levels to determine the change or rate of change ofsaid durations of said "1" levels, and means for producing a demarcationsignal representing a beginning or end of a relatively stable phoneme ifsaid change or rate of change falls within a predetermined range ofvalues.
 36. The system of claim 33 wherein said correction factor isobtained by using said remainder to address a look-up table to obtainsaid correction factor.
 37. The system of claim 32 wherein said soundsignal includes a voiced sound portion.
 38. The system of claim 37wherein said sound signal includes a fricative portion, said systemincluding means for computing a first running average of said firstnumbers and a second running average of said second numbers, means forforming a fricative vector between said first and second runningaverages, means for comparing said fricative vector with a stored mapcomposed of a plurality of vector boundary regions to determine if saidfricative vector falls within any of the stored fricative vectorboundary regions, and means for producing a phoneme or sound identifyingsignal representing a fricative corresponding to a one of said storedfricative vector regions that said character vector matches.
 39. Thesystem of claim 35 wherein said voiced sound portion includes a plosiveportion and a silent portion immediately preceding said plosive portion,said system including means for measuring the duration of said silentportion, means for comparing that duration with a plurality of storeddurations, and means for producing a signal useful in determining whichof a plurality of plosive phonemes said plosive portion corresponds to.40. The system of claim 35 means for producing a velocity slope vectorto determine a trajectory useful in predicting from which vectorboundary region the voiced character vector came from and/or to whichvector boundary region the voiced character vector is going in order toassist in identifying very short duration phonemes if either saidchanges or said rate of change exceeds a predetermined value.